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MMSE of probabilistic low-rank matrix estimation: 
Universality with respect to the output channel. 
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Abstract —This paper considers probabilistic estimation of a 
low-rank matrix from non-linear element-wise measurements 
of its elements. We derive the corresponding approximate 
message passing (AMP) algorithm and its state evolution. 
Relying on non-rigorous but standard assumptions motivated 
by statistical physics, we characterize the minimum mean 
squared error (MMSE) achievable information theoretically 
and with the AMP algorithm. Unlike in related problems of 
linear estimation, in the present setting the MMSE depends on 
the output channel only trough a single parameter - its Fisher 
information. We illustrate this striking finding by analysis of 
submatrix localization, and of detection of communities hidden 
in a dense stochastic block model. For this example we locate 
the computational and statistical boundaries that are not equal 
for rank larger than four. 

I. Introduction 

Estimation of low-rank matrices from their noisy or in¬ 
complete measurements is a problem that has a wide range 
of applications of practical interest [1], As for every broadly 
relevant data processing problem it is of interest to study 
statistical and computational limits of such an estimation 
on meaningful model settings. In this paper we evaluate the 
Bayes optimal and computationally achievable mean-squared 
error of estimation for the following two models: 

In the first model the matrix to be estimated is created as 

W = -~'=XKX T , (1) 

\Jn 

where I is a n x r matrix whose rows x, were chosen 
independently at random from some distribution P pr ior(^i), 
and K is a r x r symmetric matrix. The matrix W is 
then observed element-wise trough a noisy non-linear output 
channel P ou t{yij\wij ). with i,j = 1 ,...,n. The goal is to 
estimate the unknown matrix X from measured Y. 

We consider the problem in the limit of very large systems 
n —X oo, small rank r = 12(1), and dP ou t/dw = 12(1). The 
purpose of the scaling factor 1 / yfn in (|T|) is that the inference 
problem is neither trivially easy nor clearly impossible in 
this limit. The same model and scaling was considered 
e.g. in [2], [3], where the matrix K was an identity. The 
main algorithmic difficulty in such a setting comes from 
the high dimension n. In this work we study the idealized 
setting in which the hyper-parameters r, K , P pr i or , P ou t are 
independent of the dimension n and known. Note, however, 
that extending our approach via expectation maximization 
seems like a natural way to learn these hyper-parameters. 
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In the second model the matrix to be estimated is created 
as 

W=-=UV T , (2) 

\Jn 

where U is a n x r matrix and Tisamxr matrix, with rows 
Ui (resp. i>i) chosen from some distribution P“ rior (ui) (resp. 
Pprior (pi))- The matrix W is then observed trough the same 
output channel P out ('jjij | ) as above, and the problem is 
set analogously, adding that m/n = a = 12(1). Most of the 
discussion in this paper uses notation of the first model, but 
all our results are relevant for the second one as well. 

An algorithmic tool that first comes to our mind when 
seeing the setting above is singular (or eigenvalue) value 
decomposition (SVD) keeping the leading r components. 
SVD is the optimal estimation algorithm if the goal is to 
minimize the mean squared distance between the observed 
matrix Y and the estimator irrespectively of the properties 
of the factors (such as the prior P pr i or )- Requirements on 
the factors, such as sparsity or other structure are not 
incorporated in the basic SVD. Moreover, for a general non¬ 
linear output channel P out the sum of squares between Y and 
its estimator is not the most relevant quantity to minimize. 

Setting the problem in a fully probabilistic way is an¬ 
other common approach [4], that is in principle much 
more flexible, but algorithmically more challenging in gen¬ 
eral. To obtain a Bayes-optimal estimator of the factor Xi 
(r-dimensional column vector) we need to compute the 
marginals of the posterior probability distribution 

P{X\Y) = 2(Y\ U pprior i x i) JJ Pout {Vij \Wij ) , 

' l<z<n 

(3) 

where = xfKxj /y/n. In this paper we will leverage 
this algorithmic difficulty by realizing that techniques based 
on approximate message passing and related state evolution 
(SE) are asymptotically optimal for the above setting. For 
compressed sensing these techniques are rigorous thanks 
to series of works [5], [6], The AMP algorithm and its 
state evolution for the above setting is different from the 
one of compressed sensing, and these proofs do not apply. 
Arguments for the optimality in the present setting come 
from non-rigorous methods of statistical physics. From a 
mathematical point of view the present paper provides a 
set of accurate conjectures. Fully rigorous proof of these 
conjectures is a natural direction for future work. 

A. Examples 

The above model includes a number of examples that are 
commonly considered in the literature. Without trying to be 



exhaustive, we list a few interesting ones. 

• Sparse principle component analysis as considered in 
[2], [3]: The prior distribution P prior has a weight 1 — p 
on a vector of zeros. The output channel was considered 
as additive Gaussian noise. 

• Robust PCA [7] in which the measured matrix is a 
low-rank matrix plus a sparse large noise: The output 
channel adds small noise with some probability p and 
large noise with 1 — p. 

• Submatrix localization [8], [9]: The matrix W includes 
r submatrices (overlapping or not) that have larger mean 
than the overall mean of the matrix W. The prior x t 
then encodes in a binary manner to which of the r 
submatrices does a given variable i belong. The output 
is usually considered as Gaussian additive noise. 

• Detection of communities hidden in dense networks: 
Stochastic block model is popular for theoretical stud¬ 
ies of clustering. Nodes belong to different clus¬ 
ters/communities, the prior P pr i or then allows only 
vectors having one component 1, and 0 elsewhere. The 
observed matrix Y is binary and the probability to 
observe y. (? - = 1 is given by K a jj if Xi(a) = 1 and 
Xj(b) = 1 (i, j = 1,..., n, and a,b = 1,..., r). 

• Biclustering is a simultaneous clustering of rows and 
columns, it finds a number of applications e.g. for 
analysis of microarray data in genomics [10]. Again 
the prior encodes affinity to a cluster and the output 
function includes various models of noise. 

• Poisson noise in the matrix factorization was considered 
e.g. in [11], 

• The labeled stochastic block model is another case that 
can be reformulated in the present setting [12]. 

B. Contribution and closely related work 

As far as we know the only tools that provides asymp¬ 
totically exact analysis of the minimal mean squared error 
of models m is approximate message passing and state 
evolution as deployed in the present paper. For the output 
channel being additive Gaussian noise this was done previ¬ 
ously in [13], [2] for rank r = 1 with part of the results 
being fully rigorous, in [14] for generic rank and without 
the state evolution, and in [3] for general rank with the state 
evolution, but non-rigorously. The main contribution of this 
paper is the treatment of the case of general non-linear output 
channel P out . 

Approximate message passing and state evolution for 
a generic output channel P out was derived previously in 
the context of linear estimation [15], and later in matrix 
factorization with r = f l(n) [16]. In both these cases the 
resulting equations (both the AMP and the state evolution) 
are considerably more involved than those for additive Gaus¬ 
sian noise. In the setting of low-rank models above the 
situation is remarkably simpler as the AMP algorithm stays 
the same up to a change of the matrix Y for the so-called 
Fisher score matrix S that depends on the output channel and 
on Y element-wise, and the inverse of the Fisher information 
of the channel that we denote A and plays a role of an 


effective noise variance. In the state evolution the situation 
is even simpler in the sense that only the effective value A 
of the noise appears. 

The space of all the possible element-wise output channels 
hence reduces to one dimensional space, parametrized by 
their inverse Fisher information A. The resulting asymptotic 
MMSE depends only on A and not on other details of the 
channel. Consequently, classes of rather differently looking 
matrix estimation problems all share the same single-letter 
characterization. The contribution of the present paper is 
to unveil and quantify this property, and illustrate it on 
the example of detection of communities hidden in dense 
networks that is via this one parameter mapping related to 
localization of submatrices having different mean from the 
background matrix. 

Analogous universality with respect to the output channel 
was observed in [17] (see e.g. their remark 2.5) in the 
study of detection of a small hidden clique with approximate 
message passing. 

Our results for the Bayes-optimal estimation error of 
community detection in dense stochastic block model are of 
independent interest. Analogous results were derived for the 
sparse case in [18]. In the dense case only MSE-suboptimal 
spectral methods were evaluated [19]. We also unveil a hard 
phase existing in this problem for rank r > 4, and becoming 
very wide for r —> oo. 

We also recently learned about independently ongoing 
work of [20] who consider rank r = 1 with Radamacher 
prior, and establish rigorously the relation between the 
stochastic block model with two groups and low-rank es¬ 
timation with Gaussian channel. 


II. AMP AND STATE EVOLUTION 
A. From belief propagation to AMP 

In this section we derive the AMP algorithm to compute 
marginals of the posterior probability distribution (|3j. We 
present the derivation for the XI\X T case, for the UV 1 
everything works analogously and we only state the results. 
For convenience we introduce a function g(y, w) as 

PoutfoM = . (4) 

We require g(y. w) to be differentiable in w. For the 
previously considered Gaussian additive noise we have 
<7gan(s/) w) = -(y - w) 2 / (2A). 

In the first step, we write the belief propagation equations 
for the probability distribution 0 For this we introduce 
messages mZ^fxi) and (xj) between variables Xj and 
the factors associated to y,; ? . The BP equations read 


= W— [ d x l n t l ^ il {x l )e 9 (. yil, v* Xl Kxi ) 




1 

j\ x i) — ~yt -‘prior 


(x.i) mh^ixi), (5) 


where Z* l ^ i and Zj^- are normalizations. The main as¬ 
sumption behind these BP equations is that the messages 
ni^u{xi) and (xf) can be interpreted as probabilities 




that are statistically independent conditioned to the values 
of the variable cc, in the large n limit. This is also the 
main assumption of this paper. Arguments of theoretical 
statistical physics [21] provide heuristic justification for this 
assumption. In the above form the BP equations are not 
useful for implementation. We, however, realize that we can 
rewrite them into a much simpler form that provides the same 
marginals in the limit of large -re¬ 
in the large n limit the function g (yu,xj Kxi/^/n) 
depends only weakly on x t and Xi . Therefore we expand 
this function around w = 0 and <[5]» then becomes 


_ 1 | 1 dg 

exp(g(yu,0)) yfn dw yu ,o 

— h(yu)xjK 


a \^,u Kxi+ 
) Kxi + O 


( 6 ) 


where 


Kyu) 


d 2 g 

dw 2 Vil ,0 


di y 

dw yafij 


(7) 


Here we introduced quantities a^_ ¥il and u* ^ as the mean 
and covariance of We then take the logarithm of ([6]) 

and by using ([5]) we find 


= P P rior(xi)e B '^ Txi 


( 8 ) 


where 


nt = K d 9 
^ ij ^ dw 


Vil, 0 


aUu, (9) 


fit = K 
n 




dw 


Vilfi 




h (»«) + a Uu a Uii r ) } k ■ (!0) 


i&j 


To close these equations we finally compute the new mean 
and variance of the messages • as 

(ID 

= (fQ (AUijiBUij), (12) 


where f(A, B) is the mean of the normalized probability 
distribution 


P{x) 


Z(A B) ^ > p ri0r ( :E ) exP ( B x 



(13) 


and df/dB is its covariance matrix. Above we closed the 
intractable distributional belief propagation equations on the 
means and variances of the messages. 

As we can already anticipate it will be instrumental to 
introduce the so-called Fisher score matrix (evaluated at w = 
0) as 


Sij = 


d log P out (yij\w) 


dw 


Vij , 0 


(14) 


and the Fisher information (evaluated at w = 0) of the output 
channel as 


1 

A 


— ®‘f’out(y|l«=0) 


/ <9 log P out (t/ w) 

)1 

V dw 

y,oJ 


(15) 


We denote the inverse Fisher information by A because that 
would be the noise variance for the Gaussian additive channel 
(in that case S = Y/A). 

Using j dyP ou t(y\w) = 1, \/w and (jdj), it follows that 


E 




dg_ 

dw 


y, o 


E 


-Pout(y|™=0) 


'(dg_ 

V 

l 2 +^ 


\dw 

y,0y 

dw 2 

y, o 


= 0 , 

= 0 . 


With the above, eqs. (fTO} can be rewritten 


r>t 


I\ 


- S U a Uil , 


tyi,3 


A 1 = K _ 

nA 


E t t T 


K. 


(16) 

(17) 


(18) 

(19) 


Here we can recognize AMP equations derived and studied 
for the low-rank matrix estimation problem with additive 
Gaussian noise in [13], [14], [2], [3]. 

Remarkably the non-linear output channel enters only 
trough the value of the effective noise (15 i and an effective 
form of the observed matrix (14 1 . This is much simpler than 
what happens for the linear estimation model for which the 
generalized output AMP was derived in [15], or the matrix 
factorization [16]. 

Finally we provide a simplification that is standard to 
AMP-like algorithms and that could be called TAPyfica- 
tion [22]. We notice that variables and depend 

only weakly on the index j, and this allow us to reduce 
further the number of variables to end up with 



yi a Wi 

1<K« 


K , 


( 20 ) 


v l^i \ KKr 

d 


Ka\ 


u* +1 = 


dB 


(A\Bl). 


( 21 ) 

( 22 ) 

(23) 


Where the second term in the expression for B\ is the 
so-called Onsager reaction term, with its time index one 
iteration earlier, as is usual in AMP-type algorithms. Here 
we have reduced the number of messages to iterate from 
0(n 2 ) to 0(n). 

The procedure carried out above to derive the AMP algo¬ 
rithms can also be used to obtain a corresponding expression 
for the log-likelihood (j> = log {Z(Y))/n where Z(Y) is the 
normalization in 0 This is called the Bethe free energy. 






























Given a fixed point the AMP equations the Bethe free energy 
reads (for details see Appendix B) 

</>=^ E loglog (Zi), (24) 

l<i<n l<i<n 

where Z(A,Bi ) is the normalisation from © and 


log (Zi) = Tr 


o-jK y’ c 

V l<i<n 


x Kaial K 

cla — 


An 


E : 

l<j<n 


+ 2vj 


(25) 


This was derived in Appendix B. 


B. Summary of the algorithm 


For a given output channel we first need to evaluate the 


effective noise parameter A (15 i and the score matrix S ( 14 1 . 
Then at each iteration of the AMP algorithm we store the 
following variables. 

• a\ and a\~ 1 are the estimators of the mean of the 
variables X, at time t and t — 1. Every a* is a vector of 
size r x 1. 

is the estimator of the covariance of the variables x, 


v. 


t ; 


at time t. The v, are matrices of size r x r. 


We then compute the matrix A* and the vectors If with (20 1 
and (21 1 . Every B\ is a vector of size r x 1 while A 1 is a r x r 
matrix. Finally we compute the new estimate of the mean 
and covariance of variables x t from ( |22[ > and (|23j. Where 
the function / is defined as mean and covariance of the 
distribution in eq. (13 i. We initialize the a 4=0 very close to 0 


in order to avoid convergence problems. To help convergence 
we can also damp the iterations with some parameter 7 . More 
sophisticated damping should be implemented if convergence 
problems persist on non-synthetic data [23]. This finally 
gives us the following algorithm. 


Algorithm 1 AMP for low-rank estimation 

1: X ■<— 10 _10 Normal(n, r) 

2: A 0 id <- Zeros (n, r) 

3: v <— Zeros(r, r) 

4: diff <— 1 

5: while t < t max do 

6: B «— 4 =SXK - X ° ld KvK 

y/n n/A 

7: A <- KX l* K 

8: AN ew <T- /(A, B) 

9: WNew B) 

10: A 0 id <— X 

11: X <— (1 — 7)2LNew + jX 

12: V <- (1 - 7)u N ew + 7^ 

13: diff norm(X — A' 0 i d ) 2 /n 

14: if t > t m and diff < 10 -6 then 

15: break 

16: return X 


C. Remark about spectral algorithms 

An interesting remark can be done when noticing that 
there is an intimate relation between AMP and SVD. First 
notice that for some prior distributions AMP has a so-called 
uniform (or factorized) fixed point where the values of a,; do 
not depend on the index i (e.g. for zero mean priors). In such 
cases it is meaningful to linearize AMP around this uniform 
fixed point. This linearization can then be interpreted as a 
spectral algorithm. For the most common additive Gaussian 
output channel this gives the eigen-decomposition of Y 
(or SVD of Y for the uv T case). Of course this spectral 
algorithm comes with its advantages (non-parameteric) and 
disadvantages (hard to ensure constraints on the factors X). 

For a generic output channel the object that comes out 
from the AMP linearization is the Fisher score matrix S 
defined by ( [T4| . Following the analogy, the SVD decom¬ 
position (or eigen-decomposition) should hence be done on 
S rather than on Y. Let us give an example in which the 
difference between doing SVD on F or S' indeed improves 
performance of the spectral method. We consider rank r = 1 
and Pprior(^) = e.~ x / 2 The output channel is an 
additive exponential noise of parameter A, i.e. P out (y|u>) = 
exp (—| y — w\/\) /(2A). The score matrix S is in this case 
proportional to S « sign (yij). And indeed for l/\/2 < 
A < 1 an informative eigenvector gets out of the bulk of 
the spectrum of S but not out of F’s spectrum. Taking the 
score matrix allowed us to extract meaningful eigenvectors. 


D. State evolution 

Remarkably, the AMP algorithm is amenable to asymp¬ 
totic analysis. In statistical physics this would be called 
the cavity method [ 21 ], in linear estimation the commonly 
used term is state evolution. The SE was derived for low- 
rank matrix estimation for the Gaussian additive channel 
in [13], [2], [3] and the present case is a straightforward 
generalization. To write the SE for the present case we 
introduce two order parameters of dimension r x r. 

Q t = lT, > Mt = l E < xtT i > ( 26 ) 

l<i<n l<i<n 


where X is the vector we are trying to infer and that is 
unknown to us. State evolution (or single letter character¬ 
ization) relies on the computation of the distribution (with 
respect to the realizations of the matrix Y) of B\ given X 
(A 4 is a deterministic variable of Q 1 ). 

Using (10 1 we notice that P 4 is a sum of many terms that 
by the assumptions of the belief propagation are independent. 
Using the central limit theorem we can characterize B\ by 
its mean and variance as done in Appendix A. As a result 
one finds that 

KM ^ + (., (27, 


If = 


where f is a Gaussian random variable of mean 0 and of 
covariance I\Q t K/A. This allows us to see that the order 




















parameters Q l and M 1 will evolve as 

Q t+1 = Ep x ,p e [/(^, + C)/ T (.,.) 

M*+! = Ep xiP [/(*£* 


(28) 

(29) 


where we used abbreviations for P x = P pr i or (x) and /(t = 
A/"(0, KQ t K/ A). The remarkable part of these equations 
is that all the details of the output channel have disap¬ 
peared and the only thing that remains is the inverse of the 
Fisher information A. Larger A means larger effective noise, 
smaller Fisher information and hence harder inference. The 
estimation error in the large n limit depends on the channel 
only trough its Fisher information. 

Since in this paper we assume the knowledge of the 
generative model and its parameters the state evolution 
simplifies further as we observe that Q f = M 4 . This is called 
the Nishimori condition in statistical physics and was derived 
e.g. in [16] or [2], 

State evolution also allows us to derive the Bethe free 
energy in the large n limit as 


= E 


Px,P e 




-pi\{KMKM t ) + j^T \{KQKQ t ) . (30) 


This formula is useful when there are multiple stable fixed 
points of equations ( [28| and (291. In such a situation the one 
with the greatest log-likelihood f is the Bayes optimal one. 

E. AMP for UV T decompositions 


In this section we complete the presentation of the AMP 
algorithm and the state evolution by stating it for the UV T 
model |2|. We remind that U and V are of size nxr and m x 
r respectively and that we consider the following limit, n —> 
oo, to = an, a = f2(l), r = w(l). Each row of U and V has 
been sampled from a probability distribution Pf Tmr (u.j) and 
pp ri °r( t! .) (f or s j m pii c ity 0 f notation we will omit the upper 
index ’’prior” in this section). Distributions P u and P v have 
their corresponding input functions f u (A,B), f v (A,B) and 
their normalizations Z U (A, B), Z V (A, B ), defined according 
to m The matrix W is observed through an output channel 
defined by some probability P 0 ut(yij\u>ij)- 

The AMP algorithm for estimating U and V from Y is as 
follows 


m / m 

-®«,i = S S ik vl - ^ ( E 

v 1 1 \ U 1 


k =1 

\t _ _ 

u n A 


Ai = A J2 v\v 


k=1 


k—1 

fT 

k ’ 


up = uk, Bi.) 

T *+i 


5+1 _ ( (At p>t \ 

■i,i ~ \dB J ’ 

n / n 

^v,j = ^ S Slj u l ~ ( S a u,l 


1=1 
1 1 


Kl=1 

1 ^ r t..tT 

1=1 


^ = u \ u \ > 


V,J 


1 + 1 -(Iff) (-M, 


a v,j 


1 


(31) 

(32) 

(33) 

(34) 

(35) 

(36) 

(37) 

(38) 


The score matrix S and the effective noise A were again 
computed from the output channel following ( fl4| ) and ©■ 
The Bethe expression for the log-likelihood is obtained 
from the fixed point as 


.j n m 

f = n {E log l z u(A u , B Ut i)\+^2\og [Z v (A v ,B vJ )\ | 


3=1 


E !°g (Zij) > (39) 


where Z(A,Bj) is again the normalisation from © and 

1 °g(Zij ) = Tr 


Ui „ T u i u i av,j + VjV 1 Ou,i UiUi T 
-j=SijVj -T-TwV V j V j 

fn J An 2A n J 

(40) ' 


To write the state evolution equations we introduce the 
order parameters 


Qu = l E > M u = l E u * u oJ > 


l<i<n 


1 <i<n 


Ql = - E M v = - E 

m L ^ J J m z ' 


v j v 0j ■ 


l<j<m 


l<j<m 


(41) 

(42) 


for which holds 


V'=E V(v [. 

+ /(;■)} 

, (43) 

M‘ +1 = E p „ o ,p £i 

°puo + \/a£v)u/^ . 

, (44) 

Ql +1 = e +v 0 p 5 , 

„ (.,.)] , 

(45) 

M‘ +1 =E p „ o: 

,p e . [M $+ , 

(46) 


where f a and are Gaussian random variables of mean 
0 and covariance C/J A and Q/J A. At the fixed point we 
compute the Bethe free energy as 




log Z u ( 


aQ v aM v uo 

"A - ’ A 


+ s/at; y) 


+ Ep„ 0 ,p u logZ v ( — , — 

_ aTr(M u Mj) aTr (Q U Q/) 

A ' 2A ’ ^ 

III. Phase diagrams and examples for clustering 


In this section we illustrate our findings on the example 
of clustering into r equally sized groups. In this case the 
r-dimensional rows of the matrix X take the form of 
(0 ,..., 1,..., 0) with a 1 located at only one of the r 
coordinates. The matrix I\ is an identity. The elements of 
the matrix W are then w.^ = 1 /y/n if Xi = Xj and 
Wij = 0 otherwise. Correspondingly the prior P pr ior(x) is 
the probability distribution that picks with probability 1/r 
one of the above r vectors. The function f(A, B) defined 
via is in this case 




exp [Bi - 4f) 

E ex P ( B k - ) 

l<k<r 


( 48 ) 





























We will consider two different output channels 

• Gaussian additive noise P 0 ut,(yij\ w i:i) = e x p[— ( 2 /ij — 
Wij) 2 / (2A)]/v / 27rA. In this case the problem is inter¬ 
preted as localization of submatrices having mean 1 / \jn 
in an otherwise zero-mean matrix. 

• Stochastic block model channel where Y is its (dense) 
adjacency matrix created as P 0 ut(Pij = M w ij) = 

Pout + P-Wij and PoutiVij = 0|t%) = 1 ~ Pout - P-Wij. 

That is, nodes from different groups are connected with 
probability p out , and nodes from the same group with 
probability p in = p out + p/y/n. Both p out = 0(1), 
and p = 0(1). The Fisher score matrix has elements 
Sij = n/pout where Yij = 1 , and = -p/(l-p out ) 
where Y l3 = 0. The inverse Fisher information of the 
stochastic block model channel is given as 


A = 


Fout(l Pout) 

p 2 


(49) 


In the large n limit the resulting graph will have average 
degree np aut . 

Bayes-optimal inference in the stochastic block model was 
studied in the case of sparse networks (bounded average 
degree) [18]. For the case of dense networks (linear average 
degree) and p = 0(1) the Bayes-optimal MMSE and 
corresponding phase transitions are an original contribution 
of the present paper. 

A. MMSE and phase transitions for the symmetric clustering 


The state evolution from section II-D simplifies for the 
case of symmetric clustering in the setting above. One notices 
that if Of and M 1 are of the symmetric form 


Q t = M t = ~^J r + -I r 


(50) 


then Q t+1 and M t+1 will be of the same form. Here J r is 
a r x r matrix filled with ones, and I r is a r x r identity 
matrix. Moreover since the sum of elements of the vectors 
x is always 1 we have the additional property that 


6 * = 1, 0 < a*, 6* < 1. 


(51) 


Therefore in the end we only have one parameter b in 
the state evolution, 6 * = 1 means that we have perfectly 
reconstructed the communities, b f = 0 is when there is 
no information and the estimator for every variables is 

(~ ■■■ i) 

V r ’ 5 r / 

The state evolution equation ( [28] ) can then be used to 
derive the evolution of the parameter b under iterations. It 
follows that 

' b*' 


b t+1 =M r |v 


(52) 


where the function A4 is 


M{x) = 


r — 1 


/ 


exp (* + Ml) + J2 ex P ( u 0 i=1 

i= 2 

( 53 ) 


where we introduced a Gaussian measure 


T>Uj = du,- 


: exp 


o 

—rut 


2x 


( 54 ) 


We now observe that the above state evolution equation 
has always the uniform or uninformative fixed point b u = 0 . 
It is crucial to study the stability of this fixed point under 
iterations of ©• Depending on the number of groups r 
and the noise parameter A there may be another stable fixed 


point of (52). When it exists we will call it /; f . ir (it obviously 
depends on A and r). We expand (53 i around b u to get 

r — 4 , 


0 ( 6 ‘ 3 ). 


ht 

b t+1 = — 

A r 2 

From this equation we see that for 

A A 1 

A < A C = ~2 

the uniform fixed point will be unstable and the iteration will 
converge away from it. 


(55) 


(56) 


If we translate condition (56 1 back to the parameters of the 


stochastic block model we obtain that inference with AMP 
is possible for 


Pin Pout | P* 


/—r \JPout (1 Pout) ■ 

sjn 


(57) 


This is the same condition as known from the sparse SBM 
[18] and from standard spectral methods [24], [19]. Our 
contribution to these results is the Bayes-optimal value of 
the detection error, which is considerably better than the 
error obtained from the spectral algorithms evaluated in [19]. 
Note e.g. that the error in the spectral algorithm is always 
continuous at the transition, whereas the Bayes-optimal error 
presents a discontinuity at the transitions for r > 4 (see 
below). 


By looking at the second order term in (55 i we can get 
some information about where the iterations will converge. 


If r < 4, 


2A 2 r 4 


6 * is negative and close to A c we 


will converge to 6 f ar ( A ) = 2 (A C — A)/(4 — r) + 
0[( A C — A ) 2 ]. In this case the fixed point is a continuous 
function of the noise. 

• If r >4, things are different. Because the second 
derivative of ( [55} is positive then it is impossible for 
6 f ar to go to zero continuously as A goes to A c (a 
simple plot of ( [55] ) should convince the reader of this 
fact). There is therefore a jump in the value of the fixed 
point as A crosses A c . This is the signature of a first 
order phase transition. 

In the case of first order phase transition there are always 
three transitions to discuss 


Ac — 2 5 A s t a tic 5 A, 


spinodal • 


(58) 


The different phases have the following properties: 

• Undetectable phase without clusters, A sp ; noc } a i < A. 
Eq. ( [52} has only the uniform fixed point. Bayes optimal 
inference does not provide any information about the 
labeling of the nodes. 


















• Undetectable phase with clusters, A stat i c < A < 
A S pi n odai- Eq. (52 1 has two stable fixed point b u and 
6 f ar . But 6 f ar has a lower log-likelihood than h u there¬ 
fore the Bayes-optimal estimator is still not correlated 
with the planted solution. 

• Detectable but hard, A c < A < A stat ; c . The fixed 
point 6 f ar now has a larger log-likelihood than b u . Bayes 
optimal estimator will find a configuration well corre¬ 
lated with the planted one, but the AMP algorithms will 
not. Analogous phase appears in many other inference 
problems and we conjecture that all known polynomial 
algorithms will fail in this phase. 

• Easy phase, A < A c . The uniform fixed point b u is 
unstable and the AMP algorithm will converge to 6 f ar . 
The problem is said easy since we have a polynomial 
algorithm that (at least is conjectured to) asymptotically 
reaches optimal reconstruction. 

A theoretical analysis of A spinoda i(r) and A stat i c (r) is 
also possible as r -> oo. It relies on analysis of which 
exponential dominates the others in ( |53) , for details see 
appendix D. We find 

^spinodal — Tv r^73-[l + o(l)], (59) 


A s 


2 r ln(r) 
1 

4rln(r 


[l + o(l)], 
-[1 + 0 ( 1 )]. 


(60) 


The most notable remark about these results is that in the 
limit of large rank the gap between what is statistically possi¬ 
ble A sta ti c and what is algorithmically tractable A c = 1/r 2 
is different even in its order. Note also that the present phase 
transitions are equal to critical temperatures in the Potts 
glass model as computed in [25], [26], this is because of the 
intimate relation between random and planted models [27]. 
Note that these works also derived rank r = 4 as the point 
where the second order phase transitions changes into a first 
order one. 


B. Phase diagrams 

In this section we illustrate the behavior of the fixed points 
of the AMP algorithm and of the state evolution. In all the 
presented experiments we iterate the corresponding equations 
till convergence. To investigate the MMSE and the phase 
transitions we can initialize the AMP algorithm and the SE 
equations in two different ways: 

• Uniformative initialization: for the SE equations this 
means b t=0 = 6 , where S is very small. For the AMP 
algorithm this means aj = (-,..., ^) T + Si and v\ = 
I r /r. This is the initialization with which we would 
work if we were working with real data. 

• Informative initialization: for the SE equations this 
means b t=0 = 1. For the AMP algorithm this means 
a* = Xi and v\ = 0 , i.e. initiate the estimators to 
be equal to the planted solution that we are trying to 
recover. 

In community detection we usually evaluate an estimator 
that maximizes the number of correctly assigned nodes. For 
this we need to take the index of the maximum component 


of ctj. For numerical comparisons between the state evolution 
and the AMP algorithm we will rather evaluate the mean- 
squared error MSE = 1 — Q, where Q is the order parame¬ 
ter ( |26l >. 

In figure [T] we illustrate the fact that in the large n limit 
different output channels having the same inverse Fisher 
information A have equivalent performance. We plot the 
MSE obtained from the state evolution, compare to the one 
obtained from the AMP algorithm on a single instance of size 
n = 20000 with either a Gaussian additive noise channel or 
the stochastic block model channel. Indeed the three cases 
agree. 



Fig. 1. The MSE of the reconstruction of communities in the rank r = 2 
case. We present the result of AMP for two different channels the Gaussian 
additive noise (GAN, red crosses) and the stochastic block model (SBM, 
with pout = 0.5, green crosses) compared to the theoretical value from the 
state evolution (SE, blue line). The AMP simulations were run on a single 
instance of size n = 20000. 


In figure [2] we depict the first order phase transition that 
occurs for r > 4. We present the MSE obtained from density 
evolution as function of r 2 A for various ranks r = 10,15, 20. 
From the uninformative initialization the MSE jumps to 
1 - 1/r at A c = 1/r 2 . From the informative initialization 
the jump occurs at A sp i noc j a i > A c and by comparing 
the Bethe free energy of the two fixed points we evaluate 
the A stat ; c . We observe that already for these values of 
the rank the gap between the information theoretical and 
computationally tractable performance, i.e. between A c and 
A s t a ti c is very large. Another interesting observation we 
made concerns the value of the MSE at the A c = 1/r 2 
transition. The reconstruction is very close to exact even 
at the phase transition itself. More precisely it decreases 
roughly exponentially with the rank r, being close to 10“ 10 
for rank r = 100. 

In figure [3] we plot the values of A spinodal and A stat i c 
rescaled by rlogr as a function of the rank r. As the rank 
grows r —> oo the static line will converge to 1/4 and the 
spinodal one to 1/2 ( 59||60 i. Whereas the static transition 
converges nicely to its asymptotic value, for the spinodal 
transition we are still quite far from the asymptotic regime. 

IV. Conclusions and perspectives 

We considered here the problem of estimation of a low 
rank matrix that was observed element-wise trough an arbi- 
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Fig. 2. We plot MSE deduced from state evolution (lines) and from AMP 
(marks) for different values of rank r as a function of Ar 2 (this rescaling is 
for visual reasons). The vertical full black line is A c for all the cases. The 
vertical dashed colored lines are A s t a tic and the full lines correspond to the 
MSE obtained from the informative initializations and have discontinuities 
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Fig. 3. We plot Ar log r for the static and spinodal phase transitions ob¬ 
tained from the state evolution using the protocol described in Appendix C. 
We rescal e th e A in this way to compare with the large rank expansion in 
{59} and {60}. 


trary noisy channel. Using approximate message passing and 
its state evolution we compute the Bayes-optimal MMSE 
in this problem. The most interesting and, comparing to 
previously studied cases, surprising conclusion is that the 
MMSE depends on the output channel only trough its Fisher 
information. 

As an example of the considered setting we study the 
problem of clustering in dense stochastic block model with 
the difference between probabilities of connection within and 
between communities scaling as 0(l/i/n). We evaluate the 
phase transitions and MMSE in this regime, unveiling a wide 
region of computational hardness. 

Both the AMP algorithms and the state evolution apply to 
a number of other setting considered in the literature, which 
is a natural direction of future work. 
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Appendix A 

To derive the state evolution let us define 


Gii = —i=Sna\_ 
Jn 


■il • 


(61) 


We then have Bj = G,j . We use the central limit theorem 

l^i 

to find the distribution of Bi in order to do that one needs 
to compute, E y{Gu) and Cav(GuGj k ). Let us show how to 
compute the mean 


E(Gii) = f dyP out ( y\wu ) - 7 = 77 ^ 
J Jn dw 


v, 0 


„ a Z-uZ- 


We then expand the probability P 0 ut(y\w) around 0 


-Pout ( y\w) = P ou t(j/|0) ( 1 + w 


V,0 


Using (63 1 in (62 1 and using © and © we get 

E(Gjz) = — djxjKxi + O (™ - ®) , 


(62) 


(63) 


(64) 


where ai is the mean of a^u with respect to y,j. Using (63 1 
and the fact that a^u and a k ^ik are independent random 
variables we can compute the second moments of the Gij as 


p n . r (f~< pr \ _ tk^ a i a i P 

Uov(Gii,Gifc) — d t - — -, 


From the central limit theorem we deduce. 

bI„k*k« + w „ 
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(65) 


( 66 ) 


$ = ruj) = ^ F, - ^ Fij , 

l<i<n l<i<j<n 


where 


F* = log 


J' d^i-fprior n 


(67) 


( 68 ) 


and 

F ij = log 


J dxidxjPout {yij\wij) ni^ijixjnj-yijixj) 


Using we find 

E Fj = ^ log Z(A, Bi). 


(69) 

(70) 


For the second term we expand once again P ou t, (yij\ w ij) 
around 0 and then take the mean with respect to each 
message. We then take the log and find 


a - Kdj 


g., /;(//,p0) + ' v „ ' P 


S' 

-liTr 

2 n ±L 

HVi: 


ij 

Ka i ' Ka) „//' 


iTr 
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+ a\ 


X K («[_,„+at- 


(71) 


The TAPification procedure that is not detailed here gives 
us 


Osi—tij — 


Sjj a j + q(}_ 


(72) 


We also remove the term g(yij, 0) since it is a constant that 
does not depend on the messages. We replace ( 72 1 , ( flTf i and 
© in © and we get 


F-. — $ ■_— Tr 

Jn lJ 2nA 


Ka\a\ Ka)a) 


Tr [KaiaJKvj + KdjdJI\Vi] . (73) 

Since the sum is on all undirected links ij we do the sum 
on all directed link and divide by 2 


$= © log Z(A,Bi)- 


1 <i<r 


\ E 


Tr 


2nA 


where W t is a Gaussian random variable of mean 0 and of 
covariance KQ t Kj A. 

Appendix B 

The Bethe free energy in the case where all the factors are 
between two variables reads as 



KalaP Ka t jOj T 


J E 

. 1 < z<r; 


(74) 


To get the Bethe-free-energy in the large n limit we 
compute the mean of with respect to the noise. 

The main idea remains the same we compute the mean with 
respect to a perturbation of P out 


jg, ( 


dyP O nt(y\0) (1 + SijWij) lJ l ^ l] J ,lJ 
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SuaJ.^Ka 


Jn 


KxidJ KdjxJ 

nA 


(75) 


(76) 


To compute the mean of ( [70]) we use the fact that we know 
the distribution of the Bi ( 661 . We replace ( |76| ) in (71 1 and 
we get 


l<i<n i 


* = %,iv [log h r + w) 

-^Tr (I<MKM t ) + j^Tr (KQKQ 1 


( 77 ) 









































Appendix C 

To locate the phase transitions A sta tic and A sp ; no< iai we 
make a couple of remarks about the state evolution. First we 
remark that for \/x £ M + b = M r (x ) is a fixed point of (52 1 
for A = M r (x)/x. By definition A sp i no d a i is the greatest 
A for which fof ar exist. Therefore 


A sp i n odai — max 
a;eR + 


M r {x) 


(78) 


To find the A sta ti c one can notice that by taking the derivative 
with respect to Q and M of ( |30| one finds a combination of 
and ( |29| . Therefore we have 

</>(6 2 , A) - A) = J duM r -u. (79) 


We deduce a way to compute A sta ti c as 


I M r (x) f . . xMr(x) I 

Astatic = max < --, s.t. / AuMr{u) = --- > . 

iei+ \ x J 2 

(80) 

To compute A sta ti c we evaluate AT, oil a whole interval then 
for each x draw a line between point (0, 0) and (x,H(x)). 
We then compute the area between A4,. and this line. When 
this aera is zero then A4 r (x)/x gives us A sta ti c - 

Appendix D 


In this appendix we explain the derivation of (59 1 , (60 1 . 
To do this let us study the function A4 r (x) where we take 
x = /3rlog(r), where ft = 11(1). The important part of Air¬ 
is this integral 


exp (f + ui) 


exp (f + mi) + X) exp (m,;) i= i 

i=2 




(81) 


The important variables to look at are 

F 1 = exp + Mi) , 

(82) 

r 

F -2 = F exp (Mi). 

(83) 


i=2 


If F\ dominates f 2 as r —> +cxd then A4 r = 1, otherwise if 
F 2 dominates F\ then AI r = 0. 

To estimate F\, F 2 let us notice that with high probability 
the maximum value will be of order \/2/3 log(r). This is 
a general property of Gaussian variables that the maximum 
of r independent Gaussian variables of variance a 2 is of 
order cr^/2 log(r). We can therefore compute the mean of F 2 
all while conditioning on the fact that all of the m are smaller 
than \j2fi log(r). This allows us to compute the typical value 
of F\ as F\ ~ r&. For F 2 we obtain: when /? < 1/2 then 
F 2 ~ r^ + 3, and when if /) > 1/2 then F 2 ~ r v ^. We have 
to look at which of the F\ or F 2 has a higher exponent. In 
these computation we can therefore just assume that 


M r (/3r log(r)) =JF (/3 > 2) . (84) 

Now let us remind ( p78| ) while keeping x = /3rlog(r) 
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A flF(/3>2) „ 

Aspinodai - max | ^ log ^^ , P e R j 


2 r log(r) 


(85) 

To get the static transition we use (80 1 . Let us find the B 

( 86 ) 


that satisfies equation (80 1 we get 




/3r log(r) 

We deduce ft = 4 and therefore 

Astatic — 


ftr log(r) 
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4rlog(r) 


(87) 





















